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System for Recoeaiismg and Classifying Named Entities 

Field of the Invention 

The invention relates to Named Entity Recognition (NER), and in particular to 
automatic learning of patterns. 

Background 

Named Entity Recognition is used in natural language processing and information 
retrieval to recognise names (Named Entities (NEs)) within text and to classify the names 
within predefined categories, e.g. "person names", "location names", "organisation 
names", "dates", "times", "percentages", "money amounts", etc. (usually also with a 
catch-all category "others" for words which do not fit into any of the more specific 
categories). Within computational lingxiistics, NER is part of information extraction, 
which extracts specific kinds of information fi-om a document. With Named Entity 
Recognition, the specific information is entity names, which form a main component of 
the analysis of a document, for instance for database searching. As such, accurate naming 
is important. 

Sentence elements can be partially viewed in terms of questions, such as the 
"who", "where", "how much", "whaf ' and "how" of a sentence. Named Entity 
Recognition performs surface parsing of text, delimiting sequences of tokens that answer 
some of these questions, for instance the '*who", "where" and '"how much". For this 
purpose a token may be a word, a sequence of words, an ideographic character or a 
sequence of ideographic characters. This use of Named Entity Recognition can be the 
first step in a chain of processes, with the next step relating two or more NEs, possibly 
even givmg semantics to that relationship using a verb. Further processing is then able to 
discover the more difficult questions to answer, such as the "what" and "how" of a text. 

It is fairly simple to build a Named Entity Recognition system with reasonable 
performance. However, there are still many inaccuracies and ambiguous cases (for 
instance, is "Jxme" a person or a month? Is "pound" a unit of weight or cunency? Is 



wo 2005/064490 



2 



PCT/SG2003/000299 



"Washington" a person's name, a US state or a town in the UK or a city in the USA?). 
The ultunate aim is to achieve human performance or better. 

Previous approaches to Named Entity Recognition constructed finite state patterns 
manually. Using such systems attempts are made to match these patterns against a 
sequence of words, in much the same way as a general regular expression matcher. Such 
systems are mainly rule based and lack the ability to cope with the problems of robustness 
and portability. Each new source of text tends to require changes to the rules, to maintain 
performance, and thus such systems require significant maintenance. However, when the 
systCTQS are maintained, they do work quite well. 

More recent approaches tend to use machine-learning. Machine learning systems 
are trainable and adaptable. Within machine-learning, there have been many different 
approaches, for example: (i) maxunum entropy; (ii) transformation-based learning rules; 
(iii) decision trees; and (iv) Hidden Markov Model. 

Among these approaches, the evaluation performance of a Hidden Markov Model 
tends to be better than that of the others. The main reason for this is possibly the ability 
of a Hidden Markov Model to capture the locality of phenomena, which indicates names 
in text. Moreover, a Hidden Markov Model can take advantage of the efficiency of the 
Viterbi algorithm in decoding the NE-class state sequence. 

Various Hidden Markov Model approaches are described in: 

Bikel Daniel M., Schwartz R. and Weischedel Ralph M. 1999. An algorithm that 
leams what's in a name. Machine Learning (Special Issue on NLP); 

Miller S., Crystal M., Fox H., Ramshaw L., Schwartz R., Stone R., Weischedel R. 
and the Annotation Group. 1998. BBN: Description of the SIFT system as used for MUC- 
7. M7C-7. Fairfax, Virginia; 

United States Patent No. 6,052,682, issued on 18 April 2000 tn Miller S. et al. 
Method of and apparatus for recognizmg and labeling instances of name classes in textual 
environments (which is related to the systems in both the Bikel and Miller documents 
above); 
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Yu Shihong, Bai Shuanhu and Wu Paul. 1998. Description of the Kent Ridge 
Digital Labs system xised for MUC-7. MUC-l. Fairfax, Virginia; 

United States Patent No. 6,311,152, issued on 30 October 2001 to Bai Shuanhu. et 
al. System for Chinese tokenization and named entity recognition, which resolves named 
entity recognition as a part of word segmentation (and which is related to the system 
described in the Yu docimient above); and 

Zhou GuoDong and Su Jian. 2002. Named Entity Recognition using an HMM- 
based Chxmk Tagger. Proceedings of the 40^ Annual Meeting of the Association for 
Computational Linguistics (ACL), Philadelphia, July 2002, pp. 473-480. 

One approach within those using Hidden Markov Models relies on using two 
kinds of evidence to solve ambiguity, robustness and portability problems. The first kind 
of evidence is the internal evidence foxmd within the word and/or word string itself The 
second kind of evidence is the external evidence gathered from the context of the word 
and/or word string. This approach is described in "Zhou GuoDong and Su Jian. 2002. 
Named Entity Recognition using an HMM-based Chunk Tagger", mentioned above. 

Summary 

According to one aspect of the invention, there is provided a method of back-oflf 
modelling for use in named entity recognition of a text, comprising, for an initial pattern 
entry from the text: relaxing one or more constraints of the initial pattern entry; 
determining if the pattern entry after constraint relaxation has a valid form; and moving 
iteratively up the semantic hierarchy of the constraint if the pattern entry after constraint 
relaxation is determined not to have a valid form. 

According to another aspect of the invention, there is provided a method of 
inducing patterns in a pattem lexicon comprising a plurality of initial pattem entries with 
associated occxirrence frequencies, the method comprising: identifying one or more initial 
pattem entries m the lexicon with lower occurrence frequencies; and relaxing one or more 
constraints of individual ones of the identified one or more initial pattem entries to 
broaden the coverage of the identified one or more initial pattem entries. 
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According to again another aspect of the invention, there is provided a system for 
recognising and classifying named entities within a text, comprising: feature extraction 
means for extracting various features from the document; recognition kernel means to 
recognise and classify named entities using a Hidden Markov Model; and back-off 
modelling means for back-off modelling by constraint relaxation to deal with data 
sparseness in a rich feature space. 

According to a further aspect of the invention, there is provided a feature set for 
use in back-off modelling in a Hidden Markov Model, during named entity recognition, 
wherein the feature sets are arranged hierarchically to allow for data sparseness. 

Introduction to the Drawings 

The invention is further described by way of non-limitative example with 
reference to the accompanying drawings, in which:- 

Figure 1 is a schematic view of a named entity recognition system according to an 
embodiment of the invention; 

Figure 2 is a flow diagram relating to an exemplary operation of the Named Entity 
Recognition system of Figure 1; 

Figure 3 is a flow diagram relating to the operation of a Hidden Markov Model of 
an embodiment of the invention; 

Figure 4 is a flow diagram relating to determinmg a lexical component of the 
Hidden Markov Model of an embodiment of the invention; 

Figure 5 is a flow diagram relating to relaxing constraints wthin the determination 
of the lexical component of the Hidden Markov Model of an embodiment of the 
invention; and 
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Figure 6 is a flow diagram relating to inducing patterns in a pattern dictionary of 
an embodiment of the invention. 

Detailed Description 

According to a below-described embodiment, a Hidden Markov Model is used in 
Named Entity Recognition (NER). Using the constraint relaxation principle, a pattern 
induction algorithm is presented in the traming process to induce effective patterns. The 
induced patterns are then used in the recognition process by a back-oflF modelling 
algorithm to resolve the data sparseness problem. Various features are structured 
hierarchically to facilitate the constraint relaxation process. In this way, the data 
sparseness problem in named entity recognition can be resolved effectively and a named 
entity recognition system with better performance and better portability can be achieved. 

Figure 1 is a schematic block diagram of a named entity recognition system 10 
according to an embodiment of the invention. The named entity recognition system 10 
mcludes a memory 12 for receivmg and storing a text 14 mput through an in/out port 16 
from a scanner, the Intemet or some other network or some other external means. The 
memory can also receive text directly from a user interface 1 8. The named entity 
recognition system 10 uses a named entity processor 20 including a Hidden Markov 
Model module 22, to recognise named entities in received text, with the help of entries in 
a lexicon 24, a feature set determination module 26 and a pattem dictionary 28, which are 
all interconnected in Ihis embodiment in a bus manner. 

In Named Entity Recognition a text to be analysed is input to a Named Entity 
(NE) processor 20 to be processed and labelled with tags according to relevant categories, 
The Named Entity processor 20 uses statistical mformation from a lexicon 24 and a 
ngram model to provide parameters to a Hidden Markov Model 22. The Named Entity 
processor 20 uses the Hidden Markov Model 22 to recognise and label instances of 
different categories within the text. 

Figure 2 is a flow diagram relating to an exemplary operation of the Named Entity 
Recognition system 10 of Figure 1. A text comprising a word sequence is input and 
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Stored to memory (step S42). From the text a feature set F, of features for each word in 
the word sequence, is generated (step S44), which, in turn, is used to generate a token 
sequence G of words and their associated features (step S46), The token sequence G is 
fed to the Hidden Markov Model (step S48), which outputs a result in the form of an 
optimal tag sequence T (step S50), using the Viterbi algorithm. 

A described embodiment of the invention uses HMM-based tagging to model a 
text chunking process, involving dividmg sentences into non-overlapping segments, in 
this case noun phrases. 

Determination of Features for Feature Set 

The token sequence G (G" = gjgj is tibe observation sequence provided to 
the Hidden Markov Model, where, any token is denoted as an ordered pair of a word 
Wf itself and its related feature set/^ : =< > . The feature set is gathered &om 
simple deterministic computation on the word and/or word string v^th appropriate 
consideration of context as looked up in the lexicon or added to the context 

The feature set of a word includes several features, which can be classified into 
internal features aad external features. The internal features are found within the word 
and/or word string to capture intemal evidence while extemal features are derived within 
the context to capture extemal evidence. Moreover, all the intemal and extemal features, 
including the words themselves, are classified hierarchically to deal with any data 
sparseness problem and can be represented by any node (word/feature class) in the 
hierarchical structure. In this embodiment, two or three-leyel structures are applied. 
However, the hierarchical structure can be of any depth. 

(A) Internal features 

The embodiment of this model captures three types of internal features: 

i) f : sknple deterministic intemal feature of the words; 

ii) : intemal semantic feature of important triggers; and 
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iii) p : internal gazetteer feature. 

i) is the basic feature exploited in this model and organised into two levels: the 
small classes in the lower level are further clustered into the big classes (e.g. 
"Digitalisation" and "Capitalisation") in the upper level, as shown in Table 1. 



Table 1 : Feature : simple deterministic intemal feature of words 



Upper Level 


Lower Level 
Hierarchical feature 


Example 


Explanation 


Digitalisation 


ContainDieitAndAloha 


A8956-67 




YearFormat - TwoDigits 


90 


Two-Digit year 


YearFormat - FourDieits 


1990 




YearDecade 


90s, 1990s 


Year Decade 


DateFormat - ContainDioitDash 

VX XXXv^^ Vi* Trill fclllMI 1 Ip JLi^ Mri jxJ. 


09-99 




DateFormat - ContainDigitSlash 


19/09/99 


Date 


NumberFormat - 

X ^ fc*XXX XJ wX X VXXXXlXIr 

ContainDigitComma 


19,000 




NumberFormat - 
ContainDigitPeriod 


1.00 


Money, Percentage 


NumberFormat - 
ContainDigitOthers 


123 


Other Number 


Capitalisation 


AUCaps 


IBM 


Organisation 


ContainCapPeriod - CapPeriod 


M. 


Person Name Initial 


ContainCapPeriod - 
CapPlusPeriod 


St. 


Abbreviation 


ContainCapPeriod - 
CapPeriodPlus 


N.Y. 


Abbreviation 


FirstWord 


First word 
of sentence 


No useful capitalisation 
information 


InitialCap 


Microsoft 


Capitalised Word 
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LowerCase 


wiU 


Un-capitalised Word 


other 


Other 


$ 


All other words 



The rationale behind this feature is that a) numeric symbols can be grouped into 
categories; and b) in Roman and certain other script languages capitalisation gives good 
evidence of named entities. As for ideographic languages, such as Chinese and Japanese, 
where capitalisation is not available, can be altered from Table 1 by discarding 
"FirstWord", which is not available and combining "AllCaps" 'TnitialCaps", the various 
"ContainCapPeriod" sub-classes, "FirstWord" and "lowerCase" into a new class 
"Ideographic", which mcludes all the normal ideographic characters/words while "Other" 
would include all the symbols and punctuatioa 

ii) is organised into two levels: the small classes in the lower level are further 
clustered into the big' classes m the upper level, as shown in Table 2. 



Table 2: Feature : the semantic classification of important triggers 



Upper Level 


Lower Level 


Example 


Explanation 


NEType 


Hierarchical feature 


Trigger 




PERCENT 


SufBxPERCENT 


% 


Percentage Suffix 


MONEY 


PrefixMONEY 


$ 


Money Prefix 




SufQxMONEY 


Dollars 


Money Suffix 


DAIE 


SufiBxDATE 


Day 


Date SuflBbc 




WeekDATE 


Monday 


Week Date 




MonthDATE 


My 


Month Date 




SeasonDATE 


Summer 


Season Date 




PeriodDATE - PeriodDATEl 


Month 


Period Date 




PeriodDATE - PeriodDATE2 


Quarter 


. Quartei/riairof Year 




EndDATE 


Weekend 


Date End 


TIME 


SuffixTIME 


a.m. ' 


Time Suffix 




PeriodTime 


Morning 


Time Period 
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PERSON 


PrefixPerson - PrefixPERSONl 


Mr. 


Person Title 


PrefixPerson - PrefixPERS0N2 


President 


Person Designation 


NamePerson - FirsfNamePERSON 


Michael 


Person First Name 


NamePerson - LastNamePERSON 


Wong 


Person Last Name 


OthersPERSON 


Jr. 


Person Name Initial 


LOG 


SuffixLOC 


River 


Location SufEix 


ORG 


SufBxORG - SufOxORGCom 


Ltd 


Company Name Suffix 


SuflBbcORG - SuffixORGOtiiers 


Univ. 


Other Organisation 

Name SufBx 


NUMBER 


Cardinal 


Six 


Cardinal Numbers 


Ordinal 


Sixth 


Ordinal Numbers 


OTHER 


Determiner, etc 


the 


Determiner 



P in this imderlying Hidden Markov Model is based on the rationale that 
important triggers are useful for named entity recognition and can be classified according 
to their semantics. This feature applies to both siagle word and multiple words. This set 
of triggers is collected semi-automatically from the named entities themselves and their 
local context within training data. This feature applies to both Roman and ideographic 
languages. The trigger effect is used as a feature in the feature set of g. 

iii) p is organised into two levels. The lower level is determined by both the 
named entity type and the length of the named entity candidate v^diile the upper level is 
determined by the named entity type only, as shown in Table 3. 



Table 3: Feature p : the internal gazetteer feature 
(G: Global gazetteer; and n: the length of the matched named entity) 



Upper Level 
NEType 


Lower Level 
Hierarchical feature 


Example 


DATE(? 


DATEGn 


Christmas Day: DA.TEG2 


PERSONG 


PERSON(?n 


BiU Gates: PERS0NG2 
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LOCG 


LOCGn 


Beijing: LOCGl 


ORGG 


ORGGn 


United Nations: 0RG62 



is gathered from various look-up gazetteers: lists of names of persons, 
organisations, locations and other kinds of named entities. This feature determines 
whether and how a named entity candidate occurs in the gazetteers. This feature applies 
to both Roman and ideographic languages. 

(B) External features 

The embodiment of this model captures one type of extemal feature: 
iv) : extemal discoiurse feature. 

iv) is the only extemal evidence feature captured in this embodiment of the 
model. determmes whether and how a named entity candidate has occurred in a list of 
named entities already recognised from the docxmient 

is organised into three levels, as shown in Table 4: 

1) The lower level is detemiined by named entity type, the length of named 
entity candidate, the length of the matched named entity in the recognised 
list and the match type. 

2) The middle level is determined by named entity type and whether it is a 
frill match or not. 

3) The upper lever is determined by named entity type only. 

Table 4: Feature : the extemal discourse feature (those features not found in a 

Lexicon) 

(L : Local document; n : the length of the matched named entity in the recognised list; m: 
the length of named entity candidate; Ideni: Full Identity; and -4cw; Acronym) 
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Upper 
Level 
NEType 


Middle 
Level 
Match Type 


Lower Level 
Hierarchical 
feature /* 


Example 


Explanation 


PERSON 


PERI 
FullMatch 


PERLIdentw 


Bill Gates: 
PERZIdent2 


Full identity person 
name 


PERLAcro/i 


G. D.ZHOU: 
PERLAcro3 


Person acronym for 
"Guo Dong 
ZHOU" 


PERL 
PartialMatch 


PERILastNamwOT 


Jordan: 
PERLLastNam21 


Personal last name 
for "Michael 
Jordan" 


PERLFursfNamnm 


Michael: 
PERLFirstNam21 


Personal first name 
for "Michael 
Jordan" 


ORG 


ORGZ 
FullMatch 


ORGildentK 


Dell Corp.: 
0RGLIdent2 


Full identity org 
name 


ORGiAcron 


NUS: 
ORGLAcroS 


Org acronym for 
'TSfational Univ. of 
Singapore" 


ORGI 
PartialMatch 


ORGZPartialii/ii 


Harvard: 
0RGXtPartial21 


Partial match for 
org "Harvard 
Univ." 


LOG 


LOCX 
FiillMatch 


LOCildentn 


New York: 
L0CiIdent2 


Full identity 
location name 


LOCIAcro/i 


N.Y: LOCZAcro2 


Location acronym 
for '^ew York" 


LOCX 
PartialMatch 


LOCiPartialwm 


Washington: 
LOCZPartiaBl 


Partial match for 
location 
"WasTiingtonD.C." 



is unique to this imderlying Hidden Markov Model. The rationale behind this 
feature is the phenomenon of name aliases, by which application-relevant entities are 
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referred to in many ways throughout a given text. Because of this phenomenon, the 
success of named entity recognition task is conditional on the success in determining 
vsdien one noun phrase refers to the same entity as another noun phrase. In this 
embodiment, name aliases are resolved in the following ascending order of complexity: 

1) The simplest case is to recognise the full identity of a string. This case is 
possible for all types of named entities. 

2) The next simplest case is to recognise the various forms of location names. 
Normally, various acronyms are applied, e.g. **NY" vs. ^'New York" and 
''N.Y." vs. '"New York". Sometime, a partial mention is also used, e.g, 
"Washington" vs. "Washington D.C.". 

3) The third case is to recognise the various forms of personal proper names. 
Thus an article on Microsoft may include '*Bill Gates", **Bill*' and '"Mi. 
Gates". Normally, the full personal name is mentioned first in a document 
and later meiition of the same person is replaced by various short forms 
such as an acronym, the last name and, to a lesser extent, the first name, or 
the full person name. 

4) The most difScult case is to recognise the various forms of organisational 
names. For various forms of company names, consider a) "International 
Business Machines Corp.", "International Business Machines" and "IBM"; 
b) "Atlantic Richfield Company*' and "ARCO". Normally, various 
abbreviated forms (e.g. contractions or acronyms) occur and/or the 
company suffix or suffices are dropped. For various forms of other 
organisation names, consider a) '^National University of Singapore", 
•Rational Univ. of Singapore" and "NUS"; b) "Ministry of Education" and 
"MOE". Normally, acronyms and abbreviation of some long words occur. 

During decoding, that is the processing procedure of the Named Entity processor, 
the named entities already recognised fi:om the document are stored in a list. If the system 
encounters a named entity candidate (e.g. a word or sequence of words with an initial 
letter capitalised), the above name alias algorithm is invoked to determine dynamically if 
the named entity candidate might be an alias for a previously recognised name in the 
recognised list and the relationship between them. This feature applies to both Roman 
and ideographic languages. 
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For example, if the decoding process encounters the word *UN", the word 'VN'' 
is proposed as an entity name candidate and the name alias algorithm is invoked to check 
if the word "UN" is an alias of a recognised entity name by taking the initial letters of a 
recognised entity name. If 'United Nations" is an organisation entity name recognised 
earlier in the document, the word "UN" is determined as an alias of "United Nations" 
with the external macro context feature ORG2L2. 

The Hidden Markov Model fHMM) 

The input to the Hidden Markov Model mcludes one sequence: the observation 
token sequence G. The goal of the Hidden Markov Model is to decode a hidden tag 
sequence T given the observation sequence G. Thus, given a token sequence 
^\ -SiSi'-'gn^^^ is, using chunk tagging, to find a stochastic optimal tag 
sequence -tyt^ '"t„ that maximises 

iogP(r,-|Gr)=iog/>(2;-)+iog-^^^^, (1) 

The token sequence G" -gig2'"g„ is the observation sequence provided to the Hidden 
Markov Model, where =< > , is the initial i-th input word and -^^ is a set of 

determined features related to the word . Tags are used to bracket and differentiate 
various kinds of chunks. 



The second term on the right-hand side of equation (1), log ^ ' ' ^ ^ , is the 

^PiT,")^PiGn 

mutual information between T" mdG" . To simplify the computation of this item, mutual 
information independence (that an individual tag is only dependent on the token sequence 
G" and independent of other tags in the tag sequence T"^ ) is assumed: 

MiT,\Gi' ) = y]M^(^ .Gf ) , (2) 
Applying equation (3) to equation (1), provides: 
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logW |Gr)=logP(7;) + 2:iog-|^f^ 

and from this. 

iogP(r," |Gr) = iog?(7;'')- j;iogP(r,)+Xi<>gPa IG,") (4) 

Thus the aim is to maximise equation (4). 

The basic premise of this model is to consider the raw text, encountered when 
decoding, as though the text iiad passed through a noisy channel, where the text had been 
origmally marked with Named Entity tags. The aim of this generative model is to 
generate the original Named Entity tags directly &om the output words of the noisy 
channel. This is the reverse of the generative model as used in some of the Hidden 
Markov Model related prior art. Traditional Hidden Markov Models assume conditional 
probability independence. However, the assumption of equation (2) is looser than this 
traditional assumption. This allows the model used here to apply more context 
information to determine the tag of a current token. 

Figure 3 is a flow diagram relating to the operation of a Hidden Markov Model of 
an embodiment of the invention. In step SI 02, ngram modelUng is used to compute the 
first term on the right-hand side of equation (4). In step SI 04, ngram modelluig, where n 
= 1 , is used to compute the second term on the right-hand side of equation (4). In step 
SI 06, pattern induction is used to train a model for use in determining the third term on 
the right-hand side of equation (4). In step S108, back-off modelling is used to compute 
the third term on &e right-hand side of equation (4). 

Within equation (4), the first tenn on the right-hand side, Iogi'(rj") , can be 
computed by applying chain rules. In n-gram modelling, each tag is assumed to be 
probabilistically dependent on the N-1 previous tags. 

Within equation (4), flie second term on the right-hand side, ^ log is the 

summation of log probabilities of all the individual tags. This term can be determined 
using a uni-gram model. 
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Within equation (4), the third term on the right-hand side, £logP(f, |G;), 

/»] 

corresponds to the "lexical" component (dictionary) of the tagger. 

Given the above Hidden Markov Model, for NE-chnnk tagging, 

token =<//,w,. >, 

where = wi>v2 w„ is the word sequence, /f = fifr^fn is the feature set sequence and 
/, is a set of features related with the word . 

Further, the NE-chunk tag, , is structural and includes three parts: 

1) Boundary category: B = {0, 1, 2, 3}. Here 0 means that the current word, >v,. , 
is a whole entity and 1/2/3 means that the current word, w,, is at the 

beginning/in the middle/at the end of an entity name, respectively. 

2) Entity category: E, E is used to denote the class of the entity name. 

3) Feature set: F. Because of the lunited number of boundary and entity 
categories, the feature set is added into the structural named entity chunk tag 
to represent more accurate models. 

For example, in an initial input text ". . , Institute for Infocomm Research . , 
there exists a hidden tag sequence (to be decoded by the Named Entity processor) "... 
LORG_^* 2__0RG_* 2_0RG__* 3_^0RG_^* (where * represents tiie feature set F). Here, 
"Institute for Infocomm Research" is the entity name (as can be constructed from the 
hidden tag sequence), "Institute"/"for"/"Infocomm"/"Research" are at the beginning/in 
the middle/in the middle/at the end of the entity name, respectively, with the entity 
category of ORG. 

There are constraints between sequential tags /^.^ and within the Boundary 
Category, BC, and the Entity Category, EC. These constraints are shown in Table 5, 
where "Valid" means the tag sequence /^.j ^, is valid, "Invalid" means the tag sequence 
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tf is invalid, and "Valid on" means the tag sequence is valid as long as EC^^^ = 
ECi (that is the EC for r^^, is the same as the EC for ). 



Table 5 - Constraints between and 



BC in t, 

BC in 


0 


1 


2 


3 


0 . 


Valid 


Valid 


Invalid 


Invalid 


1 


Invalid 


Invalid 


Valid on 


Valid on 


2 


Invalid 


Invalid 


Valid on 


Valid on 


3 


Valid 


VaHd 


Invalid 


Invalid 



Back-Off Modelling 



Given the model and the rich feature set above, one problem is how to 

n 

compute2JP(^,/Gf ), the thu:d term on the right-hand side of equation (4) mentioned 

earlier, when there is insufficient mformatioa Ideally, there would be sufficient training 
data for every event whose conditional probability it is wished to calculate. 
Unfortunately, there is rarely enough training data to compute accurate probabilities v\*en 
decoding new data, especially considering the complex feature set described above. 
Back-oflF modelling is therefore used in such circumstances as a recognition procedure. 

The probability of tag t,, given G," is P(t, /Gf). For efficiency, it is assumed 
that P(t, /G,") « P(t, \E,), where the pattern entry = g,.2gMg/g/+ig/+2 and P(t, | E,) 
as the probability of tag related with£, . The pattern entry E^ is thus a limited length 
token string, of five consecutive tokens in this embodiment. As each token is only a 
single word, this assumption only considers the context m a limited sized window, in this 
case of 5 words. As is indicated above, =</;,w, >, where is the current word 
itself and /. '^<f]JiJiJi > is the set of the internal and external features, in this 
embodiment four of the features, described above. For convenience, [ £,) is denoted 
as the probability distribution of various NE-chunk tags related with the pattern entry E^ . 
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Computing P{^IEf) becomes a problem of finding an optimal frequently 
occurring pattern entry Ef , which can be used to replace ?(•/ £, ) withP(# | Ef) reliably. 
For this purpose, this embodiment uses a back-off modelling approach by constraint 
relaxation. Here, the constraints include all the /*, and w (the subscripts 

are omitted) in E^ . Faced wifli the large number of ways in which the constraints could 

be relaxed, the challenge is how to avoid intractability and keep efficiency. Three 
restrictions are applied in this embodiment to keep the relaxation process tractable and 
manageable: 

(1) Constraint relaxation is done through itemtively moving up the semantic 
hierarchy of the constraint. A constraint is dropped entirely from the 
pattern entry if the root of the semantic hierarchy is reached. 

(2) The pattern entry after relaxation should have a valid form, defined as 
ValidEntryForm ={ /w/m/^m', , ft^J^wJ,^, , ft^tf^fM . /i-i// , 

fl'^i/My //-llV/_i//5 fifi+i^M9 ft-2ft'lfi9 fl-^lfifi+i* fif^fMi fC^i> 

(3) Each in the pattern entry after relaxation should have a valid form, 
defined as ValidFeatureForm^{<flJ^JlJl^>,<fl,®Jl,®}>, 

</*,0,0,//>, </*^A^®,0>, </*,0,0,0>}, where © means empty 
(dropped or not available). 

The process embodied here solves the problem of computing P(tf/G") by 
iteratively relaxing a constraint in the initial pattern entry E^ until a near optimal 
frequently occunmg pattern entry Ef is reached. 

The process for computing PQ^ /G") is discussed below with reference to the 
flowchart in Figure 4. This process corresponds to step SI 08 of Figure 3. The process of 
Figure 4 starts, at step S202, with the feature set f, =< // ,f.^ ^f^ ^f^ > being determmed 

for all Wf within G" , Although this step in iMs embodiment occurs within the step for 



wo 2005/064490 PCT/SG2003/000299 

18 

computing P(ti /G^), that is step SI 08 of Figure 3, the operation of step S202 can occur 
at an earlier point within the process of Figure 3, or entirely separately. 

At step S204, for the current word, w,., bemg processed to be recognised and 
named, there is assumed a pattern entry jE, = Si^igi^xSiSMSm > where =< fi,Wf> and 

At step S206, the process determines if J?^ is a jfrequently occurring pattern entry. 

That is a determination is made as to whether Eg has an occurrence frequency of at least 

N, for example N may equal 10, with reference to a FrequentEntryDictionary. If is a 

frequently occurring pattern entry (Y), at step S208 the process sets E^^E^, and the 

algorithm returns P{Si /G^) = Pif^ /Ef), at step S210. At step S212, "i" is increased by 

one and a determination is made at step S214, whether the end of the text has been 
reached, i.e. \^ether i = n. If the end of the text has been reached (Y), the algorithm 
ends. Otherwise the process returns to step S204 and assumes a new initial pattern entry, 
based on the change in "i" in step S212. 

If, at step S206, E^ is a not a frequently occurring pattern entry (N), at step S216 a 
valid set of pattern entries C^(£, )can be generated by relaxing one of the constraints in 
the initial pattern entry Step S218 determines if there are any frequently occurring 
pattern entries withm the constraint relaxed set of pattern entries. If there is one such 
entry, then that entry is chosen as Ef and if there is more than one frequently occurring 
pattern entry, the frequently occurring pattern entry which maximises the likelihood 
measure is chosen as^*,^, in step S220. The process reverts to step S210, where the 

algorithm returns P{tJG^)^P{tJ E^^). 

If step S218 determines that there are no frequently occurring pattern entries in 
C^(Ef), the process reverts to step S216, where a fiirther valid set of pattern entries 

C^(Ef)can be generated by relaxing one of the constraints in each pattern entry 
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ofC^Ef) . The process continues until a firequentiy occurring pattern entry Ef is found 
within a constraint relaxed set of pattern entries. 

The constraint relaxation algorithm in computing P{tf/G"), in particular that 
relatmg to steps S216, S218 and S220 in Figure 4 above, is shown in more detail in 
Figure 5. 

The process of Figure 5 starts as if, at step S206 of Figure 4, is not a frequently 
occurring pattern entry. At step S302, the process initialises a pattern entry set before 
constraint relaxation = {< Ef,likelihood{Ef) >} and a pattem entry set after constraint 

relaxation C^yy ={} (here, likelihood{E{) = 0), 

At step S304, for a first pattem entry Ej within C^, that is 

< EjJikelihood(Ej) >g , a next constraint C* is relaxed (which in the first iteration 
of step S304 for any entry is the first constraint). The pattem entry Ej after constraint 
relaxation becomes Ej ' . Initially, there is only one such entry Ej in . However, that 
changes over fiirther iterations. 

At step S306, the process determines if Ej^ is in a valid entry form in 
ValidEntryForm y where ValidEntryForm = {fi-zfi^xfiV^i, fi^xfi^ifM^ fi^J^fM^ 

fi-\f{^iy fi^^ifM^ //-iWm//» fifM^M^ Ulfl^\f\y fi^fifi^X^ fifi^xfiVl^ fi^t^ fi-lffy 

fifi+i p fi)' If ' is not in a valid entry form, the process reverts to step S304 and a next 
constraint is relaxed. If Ej Ms in a valid entry form, the process continues to step S308. 

At step S308, the process determines if each feature in £y ' is in a valid feature set 
form, where VaiidFeamreForm = {<JtJk Jk J!!>> <fk.®.®Jk>> 

< //,//,0,0>, </4,0,0,0>}. If Ej' is not in a feature set form, the process reverts to 
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Step S304 and a next constraint is relaxed. If Ej ' is in a valid feature set form, the 
process continues to step S3 10. 

At step S3 10, the process determines if Ej ' exists in a dictionary. If Ej ' does 

exist in the dictionary (Y), at step S3 12 the likelihood of Ej ' is computed as 

number of f\ f and f in E/ 0.1 

liMihood(E) — — - . 

^ number of / , f y f \ r w in Ej 

If Ef ' does not exist in the dictionary (N), at step S3 14 the likelihood of is set as 

likelihoodiE/) = 0. 

Once the likelihood of Ej ' has been set in step S3 12 or S3 14, the process 
continues with step S3 16» in which the pattern entry set after constraint relaxation is 
altered, Coirr = Qur +{<E/Jikelihood(E/)>}. 

Step S3 18 determines if the most recent Ej is the last pattern entry Ej. within 
Cjj^ . If it is not, step S320 increases] by one, i.e. "j = j +1", and the process reverts to 
step S304 for constraint relaxation of the next pattern entry Ej within Cj^ . 

If Ej is the last pattern entry Ej within at step S318, this represents a valid 

set of pattern entries [ (E, ) , (£, ) or a further constraint relaxed set, mentioned 
above]. is chosen from the valid set of pattern entries at step S322 according to 

argmax likelihood{E f) 
<Ej \likelihood{Ej > eCo XJT 

A determination is made at step S324 as to whether the likelihood(Ef) == 0 . If 
the determination at step S324 is positive (i.e. that likeUhood{Ef) = 0 ), at step S326 the 
pattern entry set before constraint relaxation and the pattern entry set after constraint 
relaxation are set, such that Cjff = C^yj. and Cq^t = {} • The process then reverts to step 
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S3 04, where the algorithm starts going through the pattern entries Ej' as if they were 
Ej , within reset Cjj^ , starting at the first pattern entry. If the detemaination at step S324 
is negative, the algorithm exits the process of Figure 5 and reverts to step S210 of Figure 
4, where the algorithm returns P(t, /Gf) = P{t, /Ef) . 

The likelihood of a pattern entry is determined, m step S3 12, by the number of 
features , and in the pattern entry. The rationale comes fiom the fact that the 
semantic feature of important triggers (/^), the internal gazetteer feature (/^) and the 
external discourse feature (/*) are more informative in deterniiiiing named entities than 
the internal feature of digitalisation and capitalisation (/^) and the words themselves 
(w). The number 0.1 added in the likelihood computation of a pattern entry, in step 
S3 12, to guarantee the likelihood is bigger than zero if the pattern entry is frequently 
occurred. This value can change. 

An example is the sentence: 

"Mrs. Washington said there were 20 students in her class". 

For shnplicity in this example, the window size for the pattern entry b only three 
(instead of five, which is used above) and only the top three pattern entries are kept 
according to their likelihoods. Assume the current word is "Washington", the initial 
pattern entry is £2 = glS2S3 » where 

gl=<f^ = CapOtherPeriodJ^ = VxefixPersonlJ^ = 0,f^ = O, vj^ = M^s. > 
g2 =< /I = InitialCapJ^ = (I>,/| = PERILIJ^ = L0C\G\W2 = Washington > 
=< fz =LowerCase,f^ =^>/3^ "^^J^^^^^y^^ -^ciid > 

First, the algorithm looks up the entry E2 in the FrequentEntryDictionary, If the 
entry is found, the entry E2 is frequently occurring in the training corpus and the entry is 
returned as the optimal frequently occurring pattern entry. However, assuming the entry 
E2 is not found in FrequentEntryDictionary^ the generalisation process begins by relaxing 
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the constraints. This is done by dropping one constraint at every iteration. For the entry 
E2, there are nine possible generalised entries since there are nine non-empty constraints. 
However, only six of them are valid according to ValidFeatureForm, Then the 
likelihoods of the six valid entries are computed and only the top three generalised entries 
are kept: £'2''^1 ^ likelihood 0.34, £"2 ""^2 with a likelihood 0.34 and £2~w3 
with a likelihood 0.34. The three generalised entries are checked to determine whether 
they exist in the FreguentEntryDictionary, However, assuming none of them is found, 
the above generalisation process continues for each of the three generalised entries. After 

five generalisation processes, there is a generalised entry E2-wi-W2'-w^-f^ -f^ 
with the top likelihood 0.5. Assuming this entry is found in the FrequentEntryDictionary, 
the generalised entry £2'"^"'^"'^~/i^'~/2 returned as the optimal frequently 
occurring pattern entry with the probability distribution of various NE -chunk tags. 

Pattern Induction 

The present embodiment induces a pattern dictionary of reasonable size, in which 
most if not every pattern entry frequently occurs, with related probability distributions of 
various NE-chunk tags, for use with the above back-oflf modelling approach. The entries 
in the dictionary are preferably general enough to cover previously unseen or less 
frequently seen instances, but at the same time constrained tightly enough to avoid over 
generalisation. This pattern mduction is used to train the back-oflf model. 

The initial pattern dictionary can be easily created from a training corpus. 
However, it is likely that most of the entries do not occur frequently and therefore cannot 
be used to estimate the probability distribution of various NE-chunk tags reliably. The 
embodiment gradually relaxes the constraints on these initial entries, to broaden their 
coverage, whUe merging similar entries to fonn a more compact pattern dictionary. The 
entries in the final pattern dictionary are generalised where possible v/ithin a given 
similarity threshold. 

The system finds usefiil generalisation of the initial entries by locating and 
comparing entries that are similar. This is done by iteratively generalising the least 
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frequently occurring entry in the pattern dictionary. Faced with the large number of ways 
in which the constraints could be relaxed, there are an exponential number of 
generalisations possible for a given entry. The challenge is how to produce a near 
optimal pattern dictionary while avoiding intractability and maintaining a rich 
expressiveness of its entries. The approach used is similar to that used in the back-oflf 
modelling. Three restrictions are applied in this embodiment to keep the generalisation 
process tractable and manageable: 

(1) Generalisation is done through iteratively moving up the semantic 
hierarchy of a constraint. A constraint is dropped entirely from the entry 
when the root of the semantic hierarchy is reached. 

(2) The entzy after generalisation should have a valid form, defined as 
YalidEntryForm ={ /^.z/n/fi^/ > UJi^Jm . fi^ifi^xfM . Uxft^t > 

fi^ifi+\9 fl-\^i-lfi9 fifi+i^t+l9 fi-zft-lfa fi-\fifM* ftfi+lfi+29 fi'^l 9 
fi-lfn fifl+t9 fi}' 

(3) Each in the entry after generalisation should have a valid feature form, 
defined as ValidFeatureForm'='{<f^J^j^Jl^>, </t 0}>, 

</^*, ©,©,//>, </a,0,0,0>}, v^ere 0 means such a 

feature is dropped or is not available. 

The pattern induction algorithm reduces the apparently intractable problem of 
constraint relaxation to the easier problem of finding an optimal set of similar entries. 
The pattern induction algorithm automatically determines and exactly relaxes the 
constraint that allows the least frequently occurring entry to be unified with a set of 
similar entries. Relaxing the constraint to unify an entry with a set of similar entries has 
the effect of retaining the information shared with a set of entries and dropping the 
difference. The algorithm terminates when the frequency of every entry in the pattern 
dictionary is bigger than some threshold (e.g. 10). 

The process for pattern induction is discussed below with reference to the 
flowchart in Figure 6. 
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The process of Figure 6 starts, at step S402, with initialising the pattern dictionary. 
Although this step is shown as occurring immediately before pattern induction, it can be 
done separately and independently beforehand. 

The least frequently occurring entry £ in the dictionary, with a frequency below a 
predetermined level, e.g. < 10, is found in step S404. The constraint (which in the 
first iteration of step S406 for any entry is the first constraint) in the current entry E is 
relaxed one step, at step S406, such that £' becomes the proposed pattern entry. Step 
S408 determines if the proposed constraint relaxed pattern entry J?' is in a valid entry 
form in ValidEntryForm . If the proposed constraint relaxed pattern entry £' is not in a 

valid entry form, the algorithm reverts to step S406, where the same constraint JS' is 
relaxed one step further. If the proposed constraint relaxed pattern entry is in a valid 
entry form, the algorithm proceeds to step S410. Step S410 determines if the relaxed 
constraint is in a valid feature form in ValidFeatureForm . If the relaxed constraint 
E* is not valid, the algorithm reverts to step S406, where the same constraint is 
relaxed one step turther. If the relaxed constraint is valid, the algorithm proceeds to 
step S412, 

Step S412 determines if the current constraint is the last one within the current 
entry E. If the current constraint is not the last one within the cinrent entry E, the process 
passes to step S414, where the current level "i" is increased by one, i.e. "i = i + 1". After 
which the process reverts to step S406, where a new current constraint is relaxed a first 
level. 

If the c\nTent constraint is determined as being the last one within the current entry 
E at step S412, there is now a complete set of relaxed entries C(£') , which can be unified 
with E by relaxation of E^ , The process proceeds to step S416, where for every entry 
in C(£') , flie algorithm computes Similartty(EyF) , which is the similarity between 
E and E- , using their NE-chunk tag probability distributions: 
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In step S418, the similarity between E and C(E') is set, as the least similarity between 
E and any entry in C{E'): SimiIarity{E,C{E'))= min Smilarity{E,E), 

In step S420, the process also determines the constraint E^ in of any possible 
constraint jE', which maximises the similarity between E and C(£'): 
E^ = axgmaxSimilarity(E,C(E^y). In step S422, the process creates a new entry U in 

the dictionary, with the constraint E^ just relaxed, to unify the entry E and every entry in 
C(E^) , and computes entry U's NE-chunk tag probability distribution. The entry E and 
every entry in C(£^) is deleted from the dictionary in step S424. 

At step 426, the process determines if there is any entry in the dictionary with a 
frequency of less than the threshold, in this embodiment less than 10. If there is no such 
entry, the process ends. If there is an entry in the dictionary with a frequency of less than 
the threshold, the process reverts to step S404, where the generalisation process starts 
again for the next infrequent entry. 

In contrast with existing systems, each of the internal and external features, 
including the internal semantic features of important triggers and the external discourse 
features and the words themselves, is structured hierarchically. 

Hie described embodiment provides eflfective integration of various internal and 
sxtemal features in a machine learning-based system. The described embodiment also 
provides a pattern induction algoriftm and an effective back-off modelling approach by 
jonstraint relaxation in dealing with the data sparseness problem in a rich feature space. 

This embodiment presents a Hidden Markov Model, a machine learning approach, 
nd proposes a named entity recognition system based on tiie Hidden Markov Model. 
Ijrough the Hidden Markov Model, mfh a pattern induction algorithm and an effective 
ack-off modellmg approach by constraint relaxation to deal with the data sparseness 
roblem, tiie system is able to apply and integrate various types of internal and external 
/idence effectively. Besides the words tiiemselves, four types of evidence are explored: 
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1) simple detenninistic internal features of the words, such as capitalisation and 
digitalisation; 2) unique and effective internal semantic features of important trigger 
words; 3) internal gazetteer features, which determine whether and how the current word 
string appears in the provided gazetteer list; and 4) unique and effective external 
discourse features, which deal with the phenomenon of name aliases. Moreover, each of 
the internal and external features, including the words themselves, is organised 
hierarchically to deal with the data sparseness problem. In such a way, the named entity 
recognition problem is resolved effectively. 

In the above description, various components of the system of Figure 1 are 
described as modules. A module, and in particular its functionality, can be unplemented 
in either hardware or software. In the software sense, a module is a process, program, or 
portion thereof, that usually performs a particular function or related Amotions. In the 
hardware sense, a module is a functional hardware unit designed for use with other 
components or modules. For example, a module may be implemented using discrete 
electronic components, or it can form a portion of an entire electronic circuit such as an 
Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist. 
Those skilled in the art will appreciate that the system can also be implemented as a 
combination of hardvm'e and software modules. 



